Statistical mechanics of lexicon learning in an uncertain and nonuniform world 
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We study the time taken by a language learner to correctly identify the meaning of all words 
in a lexicon under conditions where many plausible meanings can be inferred whenever a word is 
uttered. We show that the most basic form of cross-situational learning — whereby information from 
multiple episodes is combined to eliminate incorrect meanings — can perform badly when words are 
learned independently and meanings are drawn from a nonuniform distribution. If learners further 
assume that no two words share a common meaning, we find a phase transition between a maximally 
efficient learning regime, where the learning time is reduced to the shortest it can possibly be, and 
a partially- efficient regime where incorrect candidate meanings for words persist at late times. We 
obtain exact results for the word-learning process through an equivalence to a statistical mechanical 
problem of enumerating loops in the space of word-meaning mappings. 



On average, children learn ten words a day, thereby 
amassing a lexicon of 60,000 words by adulthood [l| . This 
speed of learning is remarkable given that every time a 
speaker says a word, a hearer cannot be certain of its 
intended meaning [3] . Our aim is to identify which of the 
many proposed mechanisms for eliminating uncertainty 
actually deliver such rapid word learning. In this work, 
we demonstrate that important insights into this question 
can be gained by using the tools of statistical physics to 
analyse stochastic models for word learning. 

Empirical research suggests that two basic types of 
learning mechanism are involved in word learning. First, 
a learner can apply various heuristics at the moment a 
word is produced to reduce the size of the set of plausi- 
ble meanings. Such heuristics include attention to gaze 
direction [3| and prior experience of language structure 
However, these heuristics may leave some residual 
uncertainty as to a word's intended meaning in a sin- 
gle instance of use. If the heuristics are weak, the set 
of candidate meanings remains very large. This residual 
uncertainty can be eliminated by comparing separate in- 
stances of a word's use: if only one meaning is plausible 
across all such instances, it is a very strong candidate for 
the word's intended meaning. This second mechanism 
is referred to as cross-situational learning [HI, Q . There 
is little consensus as to which mechanisms are the most 
important for word learning in the real world 

Initial studies of word-learning models H^-flsj show 
that experimentally- inspired cross-situational learning 
strategies [1, HB, [13) acting with limited assistance from 
heuristics, can reproduce the rapid learning seen in chil- 
dren. In these models, a key control parameter is the 
context size: the number of plausible, but unintended, 
meanings that typically accompany a single word. Rapid 



learning is possible in the models of [12Ml5l | even when 
contexts are large, suggesting that powerful heuristics, 
capable of filtering out large numbers of spurious mean- 
ings, are not required. A recent simulation study [Tsj 
however shows that this conclusion relies on the assump- 
tion that these unintended meanings are uniformly dis- 
tributed. In the more realistic scenario where different 
meanings are inferred with different probabilities, word 
learning rates can decrease dramatically as context sizes 
increase. Powerful heuristics may be necessary after all. 

In this work, we present an exact solution of a sta- 
tistical mechanical model for word learning that shows 
that one particular heuristic — a mutual exclusivity con- 
straint — working alongside cross-situational learning can 
achieve the maximum possible learning rate. A learner 
applying this constraint, which is readily apparent in 
children [l^, assumes that no two words have the same 
meaning. From a statistical mechanical perspective, this 
constraint induces complex interactions between words 
that were absent in previous models. Our solution re- 
veals a phase transition at a critical context size between 
a regime in which mutual exclusivity speeds up learning 
to some degree, and a regime in which an entire lexicon 
is learnt in the same amount of time that is needed for 
each word to have been heard at least once. Since a lexi- 
con cannot be learnt any faster than this, learning in this 
regime is maximally efficient. In this model, the context 
size is determined by heuristics other than mutual ex- 
clusivity. Our findings show that rapid word learning is 
possible even when these other heuristics are weak. 

We begin by defining our model for lexicon learning. 
The lexicon comprises W words, and each word i is ut- 
tered as a Poisson process with rate <f)i. In all cases, we 
take words to be produced according to the Zipf distribu- 
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FIG. 1. Acquisition of a three-word lexicon. Solid shapes are 
those that have appeared in every episode alongside a word; 
open shapes are therefore excluded as candidate meanings, 
(a) In the noninteracting case, only the meaning of 'square' 
is learned, (b) In the interacting case, mutual exclusivity 
further removes meanings of learned words (shown hatched), 
both prospectively and retrospectively (shown by arrows). All 
three words are learned in this example. 



tion, = that applies for the ~10'* most frequent 

words in Enghsh [20.-^]. Here, = ZZii^/i) so that 
one word appears on average per unit time. Each time 
a word i is presented, the intended target meaning is as- 
sumed always to be inferred by the learner by applying 
some heuristics. At the same time, a set of non-target 
confounding meanings, called the context, is also inferred. 

In the purest version of cross-situational learning 
0, a learner assumes that all meanings that have 
appeared every time a word has been uttered are plausi- 
ble candidate meanings for that word. The word becomes 
learned when the target is the only meaning to have ap- 
peared in each episode. In the noninteracting case, each 
word is learned independently — see Fig. [1^. In the in- 
teracting case, mutual exclusivity acts to further exclude 
the meanings of learned words as candidates for other 
words. We take this exclusion to occur at the instant 
a word is learned, which means a single learning event 
may trigger an avalanche of other learning events by re- 
peated application of mutual exclusivity. An example of 
this nontrivial effect is shown in Fig. [Ud. Here, learning 
"square" causes "circle" to be learned at the same time. 

We consider the noninteracting case first, in part to 
illustrate our analytical methods, but also to identify the 
origin of the catastrophic increase in learning times noted 
in pJi]. Two conditions must be satisfied for the lexi- 
con to be learned by a given time: (CI) all words must 
have been exposed at least once; and (C2) no confound- 
ing meaning may have appeared in every episode that 
any given word was uttered. To express these condi- 
tions mathematically, we introduce two stochastic indi- 
cator variables. We take Ei{t) = 1 if word i has been 
uttered before time i, and zero otherwise; and Aij{t) = 1 
if confounding meaning j has appeared in every context 
alongside word i up to time t (or if word i has never 
been presented), and zero otherwise. Conditions (CI) 
and (C2) then imply that the probability that the lexi- 



L{t) = ( n m) n[i - ^^,m) = (w- A^,m) (i) 
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where the angle brackets denote an average over all se- 
quences of episodes that may occur up to time t. The sec- 
ond equality holds because Aij{t) = IVj + i if Ei{t) = 0. 

This expression is valid for any distribution over con- 
texts. For brevity, we consider a single, highly illustrative 
construction that we call resampled Zipf (RZ). It is based 
on the idea that meaning frequencies should follow a sim- 
ilar distribution to word forms [l8l |. In this approach, a 
set of M confounding meanings, A^i, is attached to each 
word i, and each meaning k is ascribed an a priori prob- 
ability Wk = l/(Cfc), where ^ is a normalization. When- 
ever word i appears, meanings are repeatedly sampled 
from A4i with their a priori probabilities, and added to 
the context if they are not already present until a con- 
text of C distinct meanings has been constructed. When 
words are learned independently, the learning time de- 
pends only on M, W and C, and not on which meanings 
are present in each set Mi (l3| . 

We seek the time, t* , at which the lexicon is learned 
with some high probability 1 - e. In the RZ model, each 
context is an independent sample from a fixed distribu- 
tion. Hence, the correlation functions {-Ai-^j-^Ai^j^---) in 
dD) all decay exponentially in time. To find t* to good ac- 
curacy in the small-e limit, only the slowest decay mode 
for each word i is needed. Higher-order correlation func- 
tions depend on many meanings co-occurring, and so de- 
cay more rapidly than lower-order correlation functions. 
As shown in Appendix \^ we find that at late times ([1]) 
is well approximated by 



L(i)-n[i-c 
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where a* is the fraction of episodes in which word i's most 
frequent confounder appears alongside the target. This 
extends previous results for independently-learned words 
[l^ - [T^ to arbitrary nonuniform confounder distributions. 

The RZ model has the further simplification that a* 
has a common value, a* , for all words i. Then, it is known 
from previous calculations [Tsj for Zipfian-distributed 
word frequencies that the learning time is 
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where Wniz) is the principal branch of the Lambert W 
function [23| . For large argument, this function behaves 
as a logarithm. 

In Fig. [21 we compare the analytical result ([3|) with 
learning times obtained from direct Monte Carlo simula- 
tions, conducted as detailed in |13|. The only complica- 



tion is that we unfortunately have no analytic expression 
for a* arising from the RZ procedure. We therefore ob- 
tain the frequency of the most common confounder for 
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Context size, C 

FIG. 2. Time to learn a lexicon of W words independently 
to a residual probability e = 0.01 with C of M confounders 
present in each episode. Points: data from Monte Carlo sim- 
ulations (over 10,000 sampled lexicons in each case). Lines: 
the analytical result, Eq. ([3]). 



given C and M from independent Monte Carlo samples. 
The agreement between (jS]) and simulation is very good. 

Fig. [5] also shows that the learning time increases 
super-exponentially with the context size. We have found 
that the probability the k^^ most confounder appears in 

a context of size C fits the form pk ~ 1 - {1 - Wk)'~^° 
where Wk is the a priori probability and A is a fitting 
parameter that depends on M and k. As noted by Vogt 
|18| . the repeated sampling without replacement implies 
that pfc>l-(l-Wfc)'^. Our analysis further reveals that 
the learning time is entirely determined by the frequency 
of the most common confounder, a*. Although the RZ 
model has the peculiarity that the second-most frequent 
meaning can be made arbitrarily less likely to appear 
than the first by increasing C (a property we will exploit 
below), this is not in itself required for the learning time 
to be dictated by a* . Indeed, the asymptotic formula ([3]) 
applies for C < 5 where the non-appearance frequencies 
are within a factor of two of one another. 

We now turn to the case where the mutual exclusiv- 
ity constraint serves to exclude the meanings of learned 
words as possible meanings for other words. In this case, 
it is important to distinguish between labeled and un- 
labeled meanings: an unlabeled meaning is not the tar- 
get meaning of any word in the lexicon, and hence can- 
not be excluded using the mutual exclusivity constraint. 
To arrive at the analog of Eq. ([IJ for this problem, we 
must identify the conditions for the lexicon to be learned. 
Condition (CI) still applies: each word must be uttered 
at least once for a learner to be able to learn it. Con- 
dition (C2) now applies only to unlabeled confounding 
meanings: these can only be excluded if they fail to ap- 
pear in a context, as before. When these two conditions 
are satisfied, there is a third — necessary and sufficient — 
condition for the lexicon to be learned that takes into 
account all the interactions and avalanches generated by 



the mutual exclusivity constraint. This is condition (C3): 
no candidate loops exist at time t. A candidate loop, 
i = (*i,«2, ■ • ■ ,*n), is a subset of distinct, labeled mean- 
ings whereby each meaning ik has appeared alongside 
word ik-i (or word in if k = 1) every time it has been ut- 
tered. Inspection of Fig. [Hd shows that the one candidate 
loop (■,•) that exists after the third episode is destroyed 
in the fourth. Then, in the fifth episode, the final word 
appears, and since no unlabeled meaning is a candidate 
for any word, the entire three-word lexicon is learned. 

To see why condition (C3) is necessary and sufficient 
in general when (CI) and (C2) hold, we first show that 
a candidate loop must exist if the lexicon has not been 
learned. Suppose word ii has not been learned. Then, 
at least one meaning, ?2, must confound word ii. Word 
i2 must also not have been learned, otherwise meaning 
i2 would not confound word ii. Hence, word «2 must 
be confounded by a meaning, i^, and so on. As there 
is a finite set of words, this sequence of meanings must 
eventually form a loop. 

We now show the lexicon cannot have been learned if a 
candidate loop exists by first assuming that it has been 
learned under these conditions. Then, if word ii was 
learned at time t, word 12 must have been learned before 
time t for mutual exclusivity to act (even if words ii and 
i2 are learned as part of the same avalanche). Iterating 
this argument around the loop, one finds that word ii can 
only have become learned at time t if it had already been 
learned at some earlier time. This contradiction therefore 
implies that the absence of candidate loops and a learned 
lexicon are equivalent. 

We again use indicator variables to translate conditions 
(C1)-(C3) into an exact expression for the learning prob- 
ability. Introducing Ce{t) = Ai^^i^{t)Ai^^i^{t)---A^^^i^{t) 
that equals 1 if the loop £ persists at time t, we have 



L{t) 
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cm) , (4) 



again valid for any distribution of confounding meanings. 
Here, meanings 1 to correspond to words 1 to M, 
and so meanings with an index j > W are unlabeled. 
The product over £ is over all possible candidate loops. 
This expression has the remarkable property that it is 
expressed concisely in terms of the word and confounder 
appearance frequencies alone: the avalanche dynamics 
triggered by mutual exclusivity do not enter explicitly. 
This property, reminiscent of the avalanche dynamics 
of Abelian sandpile models Q , reduces analysis of the 
learning probability to the statistical mechanical problem 
of enumerating candidate loops. 

In the interacting problem, the structure of each can- 
didate set Mi is important, as this determines which 
words interact. We consider a model which has no un- 
labeled meanings and where each set A^i is a sample of 
M non-target meanings obtain via the RZ prescription. 
Then, in each episode, C meanings are drawn from the 
relevant candidate set using RZ again, but with an a pri- 
ori probability l/{(^k) where k is the rank of a meaning 
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FIG. 3. As Fig.[2]but with the mutual exclusivity constraint. 
Points: data from Monte Carlo simulations (100,000 lexicons 
for C < 20, at least 2,500 lexicons for larger C). Dotted lines: 
time for the entire lexicon to have been exposed with residual 
probability e = 0.01. Dashed lines: time for the slowest decay- 
ing candidate loop to remain with probability e. Solid line: 
time to learn lexicon independently, Eq. ([3}, for comparison. 



within the set Aii when ordered by the frequency of the 
corresponding words. Thus meanings of high-frequency 
words are high-frequency confounders. Learning times 
from Monte Carlo simulations are shown in Fig. [31 

We observe two distinct learning-time regimes. At 
small C, the learning time is constant, and close to the 
time it takes for all words in the lexicon to appear at least 
once. (This time is given by Eq. ([3|) with a* = 0). In this 
regime, learning is as fast as it can possibly be: mutual 
exclusivity is maximally efficient and reverses the unde- 
sirable increase in learning times that arises from nonuni- 
form confounder distributions. Above a critical context 
size, the learning time rises, but remains much smaller 
than when words are learned independently: mutual ex- 
clusivity is partially efficient in this regime. 

Our exact result (|4]) can be used to demonstrate a 
phase transition between these two regimes. As noted 
above, the most frequent meanings will almost always 
appear in any given context when C is large due to the 
RZ procedure. However the ratio of their non-appearance 
frequencies can be made arbitrarily large. This implies 
that the most frequent non-target meaning is almost al- 
ways in the set A4i, and that this confounder has by far 
the slowest decay. Then, as shown in Appendix IbI 



L{t) ~ exp 



t 



exp 



3{l-a*)t 
2(1 



limit, we expect a phase transition from the maximally- 
efficient regime to a regime in which the second term in 
^ is dominant, and the lexicon learning time is given 
^* = - 3(1-0*) ■ This time is plotted in Fig. ©, and 
agrees very well with the simulation data in the partially- 
efficient regime. We note that for a lexicon of 60,000 
words, partially-efficient learning occurs only when each 
each word's most frequent confounder fails to appear in 
less than 0.001% of all episodes. Even then, words are 
learned over W times faster than when learned indepen- 
dently. These facts further highlight the incredible power 
mutual exclusivity has in driving down learning times. 

We have investigated a range of models in addition 
to the one described here, details of which will ap- 
pear elsewhere j25| . The phase transition between a 
maximally- and partially-efficient regime appears always 
to be present. Thus generically, mutual exclusivity can 
effect huge reductions in learning times in a variety of 
model systems, lessening the burden on other word- 
learning heuristics. The effect is particularly strong in 
the RZ model because nearly all contexts contain the 
most frequent word's meaning. We have nevertheless ob- 
served the phase transition in models where many candi- 
date loops enter into the late-time dynamics, in models 
where confounder frequencies are uncorrelated with their 
corresponding word frequencies, and in weaker versions 
of cross-situational learning where learners adopt con- 
crete hypotheses for the meaning of words as they are 
presented, as opposed to waiting for all uncertainty to 
be eliminated [2^. We also expect the transition to be 
evident in models where the target meaning does not al- 
ways appear, at least in the regime where learning is pos- 
sible [3 EBl • Finally, we note that although we have fo- 
cused squarely on the learning time in this work, there are 
other properties — e.g., the distribution of learning times 
for word i — that can be obtained from generalizations 
of Eq. (U), and may shed light on such phenomena as 
the childhood vocabulary explosion at around 18 months 
[26j . We believe the methods introduced in this work 
should be applicable to many — if not all — of these prob- 
lems. Furthermore, our results suggest new empirical 
questions, such as whether high-frequency confounders 
correlate with high-frequency words, and the extent to 
which learners are able to apply the mutual-exclusivity 
constraint retroactively. We therefore contend that sta- 
tistical physicists can contribute much to the understand- 
ing of how children learn the meaning of words. 

Acknowledgments — We thank Mike Gates and Gait 
MacPhee for comments on the manuscript. 



, (5) 



because the only confounder loop that is relevant at late 
times is ^ = (1,2). Here, the first exponential factor gives 
the probability that all words have appeared by time t. 
If 3(1 - a*)/2 > 1/W, then, at the time this probabil- 
ity equals 1 - e, the correction from the candidate loop 
is vanishingly small in the limit e 0. Hence in this 



Appendix A: Learning time in the noninteracting 
case 

In the main text, we derived a formula — given there as 
Eq. (1) — for the probability L{t) that a lexicon of words 
is learned by time t if they are learned independently by 
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cross-situational learning. This read 



(Al) 



where Ei(t) = 1 only if word i has been presented by time 
t, and Ai j(t) is 1 if word i has never been presented, or, 
if in every presentation up to time the confounding 
meaning j + i has always appeared alongside. The an- 
gle brackets denote an average over all possible exposure 
sequences. Under other conditions, these indicator vari- 
ables are zero. 

This equation was first of all presented in an alterna- 
tive form which follows from the fact that Ei{t) = im- 
plies that Aij{t) = 1 for all j + i. Hence, for all allowed 
combinations of Ei{t) and Ai j{t), we have the identity 
[1 - Ei{t)]Aij{t) = [1 - Ei{t)] which can be rearranged 
to obtain Ei{t)[l - Ai^j(t)] = [1 - Aij(t)]. Assuming 
that there is at least one confounding meaning for each 
word, the Ei{t) variables in the above equation are then 
redundant, and the more concise form 



(A2) 



then applies. 

In the main text, we discussed models where contexts 
of confounding meanings were independently sampled 
from distributions that may be word-dependent, but re- 
main fixed over time. In particular, this implies that 
the contexts appearing against different words are inde- 
pendent, and we have factorization of the average into 
word-dependent factors: 



(A3) 



Since the confounder distributions are fixed, we find after 
rii presentations of word i that 

_ r 1 with prob. a,(ji,...,jfc)"' 

- I otherwise ^ ' 

where ai{jiT ■ ■ ,jk) is the joint probability that all k 
meanings ji, j2, J3, • • • , jfc appear in a single episode. 
Since word i is presented as a Poisson process with fre- 
quency we find that 



{Ai,jlAi,j2"'Ai,jk) ~ X/ 
ni=0 



.(Ji,.-.,jfc)"'e-*'* 



(A5) 



^-0i[l-a,(ji,...ji-)]t 



Therefore, on multiplying out the average in (|A3|) . we 
find a sum of exponential decays. We are interested in 
the slowest decay mode, which corresponds to the highest 
possible value of ( ji , . . . , jk ) among all possible sets 
of confounding meanings. As noted in the main text, 
any combination of meanings ji, j2j ■ ■ ■ ,jk cannot appear 
more frequently than the least frequent meaning among 



that subset. If, for each word, the individual meaning 
frequencies ai{j) are distinct for different j, there will 
be a unique maximum appearance frequency, and the 
slowest decay is given by a* = maxj{ai(j)}. Multiplying 
the factors for each word i together yields Eq. (2) of the 
main text. We note that in the special case where the 
most frequent meaning is r-fold degenerate, we acquire a 
prefactor r in front of the dominant exponential decay. 

For the case where a* is the same for all words i, Eq. (3) 
in the main text is obtained by taking the logarithm of 
L(<), replacing the sum with an integral, and expanding 
the logarithm to first order. For a Zipf distribution of 
word frequencies, = fj, = EITiCI/Oj this proce- 

dure yields [13] 
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L(t) - J~ dxexpl 
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(A6) 



where we have used the asymptotics of the exponential 
integral [27l | to obtain the second approximate equality. 
The solution of the equation In L{t) = ln(l - e) yields the 
learning time given by Eq. (3). This involves the Lambert 
W function which is defined by solutions of the equation 
W{z)e' 



W(z) 



Appendix B: Learning time for the interacting RZ 
model 



In the interacting RZ model described in the main text, 
it is assumed that all meanings are labeled and so Eq. (4) 
for the learning probability simplifies to 



w 



(Bl) 



where here Ci{t) = A^j^^j^ (t)Ai2_i3 (t)---Ai,^_ij^ (t) for an or- 
dered subset £ = (ii, 12, . . . , in) of the W meanings. For 
the specific case of the RZ model, we argued that only 
the subset £ = (1,2) actually contributes to the late-time 
behavior of L{t). Thus we have 

/ w 

\i=i 
w 



1 



{E,{t)A,,,{t)){E,{t)A,At)) 
{EimMt)) 



(B2) 



where we have used the fact that the contexts presented 
alongside different words are uncorrelated. {Ei{t)) is the 
probability that an event governed by a Poisson process 
with frequency 0; has occurred at least once by time t. 
Hence, 



w 



w 



(B3) 
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This is of the same form as Eq. (3) of the main text, but 
with a* = 0, and so from (|A6p we have that 

fj(£;.(t))^exp(-^e-/-^j. (B4) 

This provides the first contribution to the learning proba- 
bihty for the lexicon in the interacting RZ model, Eq. (5) 
of the main text. 

Using again the identity [1- Ei{t)]Aij{t) = [l-Ei{t)] 
from the previous section, we find Ei{t)Ai,j{t) = 1 - 
Aij{t) - Ei{t). Hence, 

{E,{t)A,m e-^-[^-°-(^)]*-e-^'* 

= ^Em) ~- — — ■ ^^'^ 

If ai{j) is close to unity, as is the case for the high- 
frequency meanings in the RZ model, we have at late 



times that 

- e-'^'[i-'^'(j)]* . (B6) 

Using this expression in (jB2p in conjunction with (|B4p . 
and noting that 0^ = we arrive at Eq. (5) in the 

main text. 

For more general models, in which more than one can- 
didate loop enters at large times, we have found that 
including only loops of length 2 in the product in (jBip 
yields very good agreement with simulation data. More 
precisely, numerically-determined roots of L{t) = 1 - e 
with L{t) given by the approximate expression 

w 

L(t)«n[i-e"''']n[i-A.A,^] (B7) 

correspond well with simulation data, and furthermore 
provides evidence that the phase transition reported in 
the main text is generic, as claimed. 
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